Automated phenotyping of patients with non-alcoholic fatty liver disease reveals clinically relevant disease subtypes

Non-alcoholic fatty liver disease (NAFLD) is a complex heterogeneous disease which affects more than 20% of the population worldwide. Some subtypes of NAFLD have been clinically identified using hypothesis-driven methods. In this study, we used data mining techniques to search for subtypes in an unbiased fashion. Using electronic signatures of the disease, we identified a cohort of 13,290 patients with NAFLD from a hospital database. We gathered clinical data from multiple sources and applied unsupervised clustering to identify five subtypes among this cohort. Descriptive statistics and survival analysis showed that the subtypes were clinically distinct and were associated with different rates of death, cirrhosis, hepatocellular carcinoma, chronic kidney disease, cardiovascular disease, and myocardial infarction. Novel disease subtypes identified in this manner could be used to risk-stratify patients and guide management.


Introduction
Non-alcoholic fatty liver disease (NAFLD) is estimated to affect 25% of the global population. 1 NAFLD is a chronic liver disease associated with the metabolic syndrome that can progress to cirrhosis and hepatocellular carcinoma (HCC). In the United States, NAFLD-related liver failure has become the second most common indication for liver transplants, after chronic hepatitis C. 2,3 This trend is expected to continue, with NAFLD prevalence rising to 33.5% of the adult US population by 2030, and driving increases in both cirrhosis and HCC. 4 NAFLD is a heterogeneous disease which has been associated with a variety of adverse outcomes. Besides cirrhosis and HCC, NAFLD has also been associated with cardiovascular disease (CVD) 5,6 and chronic kidney disease (CKD). 7 In some cohorts, CVD is the leading cause of death among NAFLD patients, followed by malignancy and liver-related mortality. [8][9][10] Some NAFLD subtypes and prognostic factors have been identified. Patients with both steatosis and inflammation (i.e. nonalcoholic steatohepatitis, NASH) have worse outcomes than those with bland steatosis. 11,12 Similarly, patients with NAFLD-associated cirrhosis have worse outcomes than those who do not. 8 Interestingly, although cirrhosis strongly predicts HCC, some NAFLD patients develop HCC in the absence of cirrhosis. 13 Hispanic populations tend to have higher rates of NAFLD; 14 a variant in PNPLA3 associated with hepatic steatosis and NASH has been identified and is more common among Hispanic individuals. 15 Given the clinical variability among NAFLD patients, we hypothesized that there may be clinically relevant patient subtypes which could be identified using unbiased machine learning algorithms. The identification of such subtypes could enable more precise prognostication and management for NAFLD patients.

NAFLD definition
In order to define NAFLD, we developed an algorithm based on two published electronic medical record (EMR)-based algorithms. 16,17 First, we identified patients with liver disease based on persistent ALT elevation or ICD codes for chronic non-specific or non-alcoholic liver disease (ICD-9: 571.5, 571.8, 571.9; ICD-10: K75.81, K76.0, K76.9). Persistent ALT elevation was defined as two or more instances of ALT ≥ 40 IU/mL for men, or ≥ 31 IU/mL for women in the ambulatory setting, more than 6 months apart. Then, we excluded patients with viral hepatitis, alcoholic liver disease, or other chronic liver disease. These conditions were identified via ICD codes, as enumerated in the eMerge algorithm. Viral hepatitis cases were also identified using lab values (HBV surface antigen, HCV RNA). Next, we excluded patients on steatogenic medications (defined in eMerge). Finally, patients must have had evidence of hepatic steatosis on imaging, biopsy, or documented in a clinical note. These instances were identified using natural language processing (NLP) to identify mentions of hepatic steatosis and related terms.

Natural language processing
The eMerge algorithm requires mention of hepatic steatosis in a free-form text document (imagery or biopsy result, or clinical note). We developed a tool to get this information from the database, using the following steps: • query the SQL database for documents containing any of these terms

•
parse the documents to remove negative results (e.g. absence of steatohepatitis), occurrences in family and other false positive patterns This process was adapted to look for mentions of deceased patients (see Section 2.4), to find patients with cirrhosis (see Section 2.6), and to gather MELD scores (see Table 1).

Data collection
The cohort for this study was created using the criteria defined in Section 2.1. These EMR data were obtained from the database of a large metropolitan hospital in New York City. We choose to only consider patients who met the criteria for NAFLD after December 31, 2012, up to January 31, 2019. We called NAFLD diagnosis date the earliest such date for each patient.
13,290 patients matching these criteria were found in the database. In the rest of this section, we describe, for different types of information, the data collection and pre-processing steps that were taken. In order to build a dataset usable by machine learning algorithms, we transformed the information contained in the database into binary features. When possible, we reduced the number of resulting features. Feature selection has been shown to improve the quality of results in machine learning applications. 18 This process is usually done using statistics-or heuristics-based algorithms. However, in the case of practical applications, we can use domain knowledge instead. We took advantage of established knowledge to reduce the number of features by mapping to higher-level concepts, or discarding infrequent features.
• Procedures used the Current Procedural Terminology (CPT) coding system. We mapped the CPT codes to their respective second-level group code. For example, the group containing all CPT codes from 33010 to 37799 describes surgeries of the cardiovascular system. This process grouped the codes into 115 categories that translated directly into features.
• Medication prescriptions or administrations. We mapped the medication names to the corresponding RxNorm drug concepts, and again kept those that occurred in at least 0.1% of the cohort. We only considered drugs which had at least two prescriptions separated by 6 months or more, in order to discard drugs only used acutely (e.g. post-surgery) which do not reflect a patient's regular medications.
Using this process, we obtained 293 clinical drugs.

Laboratory tests-As
opposed to the previous data types, which were wellformatted and standardized, laboratory tests could be either qualitative or quantitative, and were often reported in free-text form. For qualitative tests, we parsed the result and searched for terms that indicated if it was abnormal, such as abnormal, low, below average, reactive.
For quantitative tests, we searched the results for numeric values that fell outside the normal range.
We obtained 533 distinct laboratory tests, which translated to as many binary features. For example, feature platelets means abnormal result for platelets test. A shortcoming of this approach is that abnormally low and high values are grouped in the same feature, even though they have different medical significance. However, since one laboratory test can use different units, and thus different normal ranges (e.g. normal and log scales), automatically assigning a value to low or high is not always reliably doable.

Vital signs-Similar
to laboratory tests, we searched for abnormal values for the standard vital signs collected in clinical settings, using the following criteria: • body temperature: > 39°C (Celsius) or 102 • F (Fahrenheit).

Patient pairwise distance and clustering
In order to identify different subtypes, we computed the patient distance matrix and applied an algorithm of unsupervised clustering to the data obtained. Unsupervised clustering is well-suited for exploratory tasks in applied research. 20 First, validation of the results obtained using expert knowledge is possible. In the present study, the findings were reviewed and interpreted by medical experts. Second, the "unsupervised" aspect allows discovery of new, potentially unexpected insight from the analysis of a large number of features.
Many clustering algorithms have been developed. Finding the "best one" remains an open problem, 21 since unsupervised learning tasks lack objective measures to assess their performance. Several measures have been proposed to evaluate the quality of a set of clusters, 22 but the general guideline is that the best algorithm and parameters are different for each data set.
We chose a hierarchical clustering algorithm using the Manhattan distance for pairwise similarity of patients, and minimizing the increase in variance during cluster merging as linkage criterion (also known as Ward's criterion). Hierarchical clustering is a standard algorithm, and it has been used previously in a study looking for comorbidity clusters in autism disorders. 23 We used the R hclust implementation of this algorithm, with ward.D2 as parameter for agglomeration criterion. 24 We chose to have 5 subtypes (clusters) as a balance between granularity and size. These parameters were chosen empirically, after qualitative validation of the results obtained with various combinations.

Statistical analysis
2.6.1. Descriptive statistics-Categorical features were summarized as proportions and compared using the chi-squared test. Continuous features were summarized as means ± standard deviation and compared using ANOVA, or as medians and interquartile ranges compared using the Wilcoxon rank-sum test. Comparisons for each subtype were made against patients in all remaining subtypes. Significance was defined as a false discovery rate <0.001.

Survival analysis-
The primary outcome was overall survival. Secondary outcomes were HCC, cirrhosis, CKD, CVD, and acute myocardial infarction (MI). In all cases survival was defined as the time from NAFLD diagnosis to the earliest evidence of the outcome. HCC cases were first identified using ICD codes (ICD-9 155.0,155.2; ICD-10 C22.0,C22.7-C22.9), then confirmed through chart review. Cirrhosis was defined using natural language processing looking for mentions of cirrhosis in clinical notes, imaging reports or biopsy reports. Chronic kidney disease was defined using corresponding ICD codes (ICD-9 585-586; ICD-10 N18-N19) and CPT codes for dialysis (90935 to 90999). Cardiovascular disease was defined using ICD codes for any ischemic heart disease (ICD-9 410-414; ICD-10 I20-I25). Acute MI was a subset of the CVD outcome (ICD-9 410; ICD-10 I21-I22).
The primary predictor in survival analyses was subtype. Secondary predictors included age, gender, race and FIB-4 category. Race and ethnicity were combined for the purposes of this analysis, with Hispanic ethnicity given precedence and mapped to the Hispanic race category. The primary outcome was overall survival. Secondary outcomes were onset of cirrhosis, HCC, CVD, MI, and CKD. All survival analyses were done in R 3.6.0. For the outcome of overall survival, Kaplan-Meier curves were created using the ggplot2 25 and survminer 26 packages; univariate and multivariate Cox proportional hazards models were constructed using the survival package. 27 For non-death outcomes, only incident cases were included in the analysis. Cases diagnosed prior to or within 6 months of NAFLD diagnosis were treated as prevalent. Death was treated as competing hazard. The cumulative incidence function was calculated for each outcome using the cmprsk package 28 and plotted using ggplot2. The cmprsk package was also used to fit univariate and multivariate Fine-Gray proportional subdistribution hazards regression models for the non-death outcomes.
This study was reviewed and approved by the Mount Sinai Hospital institutional review board (GCO 10-0032 and 16-1437).

Descriptive statistics for the cohort
Merging the data from the different sources described above, we obtained a data set containing 13,290 patients with NAFLD, described by 1,145 binary features (

Identification of NAFLD subtypes
The two largest subtypes (1 and 3) encompassed 87% of patients, while the remaining patients are divided among 3 smaller subtypes ( vs 3.6%). Medications more commonly prescribed in this subtype included cardiac medications such as aspirin, lisinopril, amlodipine, metoprolol, and atorvastatin; diabetes medications such as metformin and insulin; pain medications such as acetaminophen, gabapentin, oxycodone, and morphine; respiratory medications such as albuterol and fluticasone; antacid medications such as omeprazole and famotidine, and also vitamin D. Subtype 2 patients were also more likely to have had digestive surgery (40.1% vs. 16.8%).
Overall, subtype 2 patients had metabolic syndrome with signs of developing liver dysfunction and were high healthcare utilizers.
Patients in subtype 3 tended to be younger, Caucasian and had the fewest inpatient admissions and the fewest prescriptions on average. Subtype 3 patients had fewer comorbidities than other patients, and were unlikely to have abnormal lab values associated with liver dysfunction. Subtype 3 patients were relatively healthy compared to the rest of the cohort.
Patients in subtype 4 were more likely to be older, male and Caucasian. They had high FIB-4 scores at baseline and were likely to have abnormal labs suggesting liver synthetic dysfunction. These patients were less likely to be obese or to have hyperlipidemia (20.8% vs 28.7%), though diabetes and hypertension were common. Overall, subtype 4 patients likely had liver fibrosis at baseline and had labs suggesting progression to cirrhosis.

Identification of distinct outcomes by NAFLD subtype
Univariate analyses showed that risk of outcomes varied by subtype membership (Figures 1  and 2 In multivariate analyses accounting for age, gender, race and baseline FIB-4, subtype membership remained an independent predictor of outcomes ( Figure 3). With subtype 1 as the reference, Subtype 5 was independently associated with the highest risks for death (

Internal cross-validation of the subtypes discovered
Formal validation of the results is inherently complicated for unsupervised clustering, where no "true label" exist for any patient. In order to assess the robustness of our results, we have performed internal cross-validation on our dataset, as we have no access to EMR in other medical centers. We have randomly selected 90% of samples, run the clustering process on this new training set, and repeated the process 10 times. We have identified similar enriched clinical features and disease comorbidities in the subtypes that we have discovered previously. We reported the full results in the supplementary table 1 hosted at https:// github.com/mv50/psb20_mat.

Conclusion
In this study, we combined two existing signatures of NAFLD and used them to gather a cohort of 13,290 patients with confirmed NAFLD. We used unsupervised clustering to identify five subtypes of patients. These subtypes had different clinical characteristics and different outcomes: the two larger groups had fewer comorbidities and more positive outcomes, while a minority of the cohort (in the three smaller subtypes) had more serious comorbidities and worse outcomes. To our knowledge, this study is the first to use an artificial intelligence approach to delineate clinically relevant subtypes of NAFLD.
Our findings are consistent with prior studies reporting higher rates of NAFLD among Hispanic patients. 14 In addition, the subtypes reveal that Hispanic patients with NAFLD are on a continuum of risk, with some exhibiting the metabolic syndrome but having good outcomes (subtype 1), others experiencing predominantly non-liver adverse outcomes (subtype 2) and some with severe liver disease and at risk for multiple adverse outcomes (subtype 5).
Our study of heterogeneity among NAFLD patients was strengthened by the diverse patient population within Mount Sinai's catchment area and the comprehensive use of EMR records. We gathered data from various sources to build the features: vital signs, diagnoses, procedures, prescriptions, laboratory results, radiology and pathology reports. Our approach is generalizable and could be applied by local or regional healthcare systems to define disease subtypes within their own patient populations. Such efforts could help guide resource allocation at the local level, in contrast to national or international guidelines which may not be relevant to all localities and patient populations.
The limitations of our study are common to EMR-based projects. ICD codes are prone to miscoding and may not accurately represent a patient's medical condition. We used phecodes to map ICD codes to higher-level disease concepts in order to improve power and simplify instances where there are multiple related ICD codes. The pre-processing and cleaning of the data remains open to improvements. Additionally, more systematic incorporation of data from unstructured clinical notes could bring valuable new information.
In conclusion, we defined an EMR-based algorithm for identifying NAFLD patients and showed that unsupervised clustering can be used to identify clinically relevant disease subtypes with distinct patterns of adverse outcomes. If prospectively validated, these disease subtypes could help guide patient management and screening initiatives.